During the lab, Hariprasath focused on assignment 1 and Lakshidaa focused on assignment 2. After the lab, both of us collaborated to complete the remaining tasks and to clarify each other’s doubts. Then we discussed together and answered the questions related to analysis and interpretation of the plots.
# Reading mosquito data
aegypti_data = read.csv("aegypti_albopictus.csv")
Dot map of the distribution of the 2 types of mosquitoes in the world in 2004.
# 2004
plot_mapbox(aegypti_data[aegypti_data$YEAR == 2004,], lat = ~Y, lon = ~X) %>%
add_markers(color = ~VECTOR, hoverinfo = "text",
text = ~paste(COUNTRY), alpha = 0.7) %>%
layout(title = "Dot map of mosquito distribution in the world (2004)")
Secondly, we plot a dot map of the distribution of the 2 types of mosquitoes in the world in 2013.
# 2013
plot_mapbox(aegypti_data[aegypti_data$YEAR == 2013,], lat = ~Y, lon = ~X) %>%
add_markers(color = ~VECTOR, hoverinfo = "text",
text = ~paste(COUNTRY), alpha = 0.6) %>%
layout(title = "Dot map of mosquito distribution in the world (2013)")
In 2004, the Aedes Albopictus mosquito was concentrated in Taiwan and an area in south-east of USA and the Aedes Aegypti mosquito was concentrated in certain areas in south-east Asia (Indonesia, India, Thailand) and certain coastal areas in South America (Brazil, Venezuela). In 2013, the Aedes Albopictus mosquito is concentrated in Taiwan and to some extent in Italy and the Aedes Aegypti is extremely concentrated in Brazil. Comparing 2004 to 2013, Aedes Aegypti has spread extremely in Brazil and the Aedes Albopictus has remained to be concentrated in Taiwan. Other observations of both mosquito types have been eliminated in most of the countries from 2004 to 2013.
The dataset contains a large number of records and are concentrated in a few regions/countries (mainly Brazil and Taiwan). This causes overplotting and it is difficult for the viewer to get an accurate idea of the number of mosquitoes in those countries. This is the perception problem observed in the dot maps.
We plot a choropleth map showing the number of mosquitoes in each country during the entire study period. The map projection used in equirectangular.
plot_geo(country_aggregate) %>%
add_trace(
z = ~Count,
text = ~COUNTRY, locations = ~COUNTRY_ID
) %>%
layout(
title = "Choropleth plot of number of mosquitoes",
geo = g_equirec
)
Taiwan had a very large number of mosquitoes in the study period (value=24837) while the second largest value was that of Brazil at 8501. This skews the color scale a lot. Hence, most of the other countries are denoted using a much smaller range of colors and are difficult to differentiate. So, the map shows very little information.
Firstly, we make a choropleth plot of the log(Number of mosquitoes) during the entire study period using equirectangular projection.
plot_geo(country_aggregate) %>%
add_trace(
z = ~log(Count) ,
text = ~paste(COUNTRY, "\n Count: ", Count), locations = ~COUNTRY_ID,
hoverinfo = "text"
) %>%
layout(
title = "Choropleth plot of number of mosquitoes",
geo = g_equirec
)
After using a log transformation, we are able to see more information in the map. Apart from large values in Taiwan and Brazil, we oberve that there were considerable values in USA, south-east Asia, Australia, Italy, Mexico and other South American countries as well.
Secondly, we make a choropleth plot of the log(Number of mosquitoes) during the entire study period using conic equal area projection.
plot_geo(country_aggregate) %>%
add_trace(
z = ~log(Count) ,
text = ~paste(COUNTRY, "\n Count: ", Count), locations = ~COUNTRY_ID
) %>%
layout(
title = "Choropleth plot of number of mosquitoes",
geo = g_eqarea
)
In both the projections, the angle and area are not maintained but in the conic equal area projection, the difference in areas and angles are minimized in the middle latitudes atleast.
Equirectangular projection: The advantage of this projection is that it is rectangular and more straight forward to analyze and identify the countries. The disadvantage is that there are heavy distortions in the map away from the equator and could be misleading when conclusions are made based on area comparisons.
Conic equal area projection: The advantage is that the countries which span from east to west are well represented in the map. The disadvantage is that the countries which span north to south are distorted and could be misleading.
We plot a dot map of the number of mosquitoes in Brazil during 2013 by splitting the area into a 100x100 grid.
plot_mapbox(brazil_grid_data, lat = ~Mean_Y, lon = ~Mean_X) %>%
add_markers(color = ~N) %>%
layout(title = "Dot map of number of mosquitoes in Brail (2013)",
mapbox = list(zoom = 2.7, center = list(lat = ~median(Mean_Y, na.rm = TRUE),
lon = ~median(Mean_X, na.rm = TRUE))))
## Warning: Ignoring 1 observations
Yes, discretizing the distribution of mosquitoes using a grid helped in analyzing the occurences. We can clearly observe that the extreme east part and the south-eastern part of Brazil were most affected by mosquitoes.
# Reading income data and map of Swedish counties
swe_data = read.csv("000000KD.csv")
rds = readRDS('gadm36_SWE_1_sf.rds')
Loaded the map of Swedish counties and grouped our data into 3 age groups - Young, Adult, Senior.
head(swe_data_processed)
## region Young Adult Senior
## Stockholm Stockholm 385.4 659.5 683.3
## Uppsala Uppsala 300.9 542.7 580.4
## Södermanland Södermanland 317.5 489.4 507.6
## Östergötland Östergötland 290.3 502.7 532.8
## Jönköping Jönköping 330.8 518.8 556.7
## Kronoberg Kronoberg 307.3 503.2 530.3
Violin plots showing mean income distributions per age group.
ggplot(swe_data) +
geom_violin(aes(x = age, y = X2016),
draw_quantiles = c(0.25, 0.5, 0.75), fill = "#A982F7") +
scale_x_discrete(labels=c("18_29" = "Young",
"30_49" = "Adult",
"50_69" = "Senior"),
name = "Age Group") +
ylab("Mean Income (SEK thousands)") +
ggtitle("Distribution of mean incomes in different age groups")
From the plot it can be seen that in the year of 2016, the Young age group received the least income as compared to the other age groups. Most of the people from the Young age group received a mean income of approximately 300. The maximum income received by members in this age group is around 385.
The mean income for the Adult age group ranges approximately from 455 to 660. Most of the people from this age group receive a mean income of about 480. The mean income for the Senior age group ranges approximately from 470 to 683. Most of the people from this age group receive a mean income of about 490.
The distribution of mean incomes of senior and adult age groups are similar while young age group has a much lesser value. All the 3 distributions appear to be skewed to the right.
Surface plot showing dependence of Senior incomes on Adult and Young incomes in various counties.
s = interp(swe_data_processed$Young, swe_data_processed$Adult,
swe_data_processed$Senior, duplicate = "mean")
plot_ly(x=~s$x, y=~s$y, z=~s$z, type="surface") %>% layout(
scene=list(
xaxis = list(title = "Young"),
yaxis = list(title = "Adult"),
zaxis = list(title = "Senior")
))
From this surface plot, it can be interpreted that with the increase in Young and Adult incomes, the Senior income also appears to increase. And, the surface plot approximately looks like a plane. Therefore, we can use linear regression to model Senior incomes with dependence on Young and Adult incomes.
rds$Young = swe_data_processed[rds$NAME_1, "Young"]
plot_ly() %>% add_sf(data = rds, split = ~NAME_1,
color = ~Young, showlegend = F, alpha = 1) %>%
layout(title = "Choropleth plot of mean income of Young age group")